OutcropHyBNet: Hybrid Backbone Networks with Data Augmentation for Accurate Stratum Semantic Segmentation of Monocular Outcrop Images in Carbon Capture and Storage Applications

The rapid advancement of climate change and global warming have widespread impacts on society, including ecosystems, water security, food production, health, and infrastructure. To achieve significant global emission reductions, approximately 74% is expected to come from cutting carbon dioxide (CO2) emissions in energy supply and demand. Carbon Capture and Storage (CCS) has attained global recognition as a preeminent approach for the mitigation of atmospheric carbon dioxide levels, primarily by means of capturing and storing CO2 emissions originating from fossil fuel systems. Currently, geological models for storage location determination in CCS rely on limited sampling data from borehole surveys, which poses accuracy challenges. To tackle this challenge, our research project focuses on analyzing exposed rock formations, known as outcrops, with the goal of identifying the most effective backbone networks for classifying various strata types in outcrop images. We leverage deep learning-based outcrop semantic segmentation techniques using hybrid backbone networks, named OutcropHyBNet, to achieve accurate and efficient lithological classification, while considering texture features and without compromising computational efficiency. We conducted accuracy comparisons using publicly available benchmark datasets, as well as an original dataset expanded through random sampling of 13 outcrop images obtained using a stationary camera, installed on the ground. Additionally, we evaluated the efficacy of data augmentation through image synthesis using Only Adversarial Supervision for Semantic Image Synthesis (OASIS). Evaluation experiments on two public benchmark datasets revealed insights into the classification characteristics of different classes. The results demonstrate the superiority of Convolutional Neural Networks (CNNs), specifically DeepLabv3, and Vision Transformers (ViTs), particularly SegFormer, under specific conditions. These findings contribute to advancing accurate lithological classification in geological studies using deep learning methodologies. In the evaluation experiments conducted on ground-level images obtained using a stationary camera and aerial images captured using a drone, we successfully demonstrated the superior performance of SegFormer across all categories.


Introduction
The escalating global phenomenon of climate change, resulting from the warming of the earth, has reached a level of utmost urgency.In its Second Working Group Report of the Sixth Assessment Report [1], the Intergovernmental Panel on Climate Change (IPCC) highlighted the profound impact of climate change on various human systems, including ecosystems, water security, food production, health and well-being, cities, residences, and infrastructure.According to the First Working Group Report, the global average temperature from 2011 to 2020 has risen by 1.09 °C compared to the pre-industrial era.Furthermore, the IPCC announced that even under scenarios with extremely low greenhouse gas emissions, such as achieving zero carbon dioxide (CO 2 ) emissions by around 2050 or later and subsequent negative emissions, there is a possibility of global temperature increase reaching 1.5 °C between 2021 and 2040.The report also indicated that the frequency of extreme temperature events in terrestrial areas that occur once every 10 years or once every 50 years is likely to increase by 4.1 times and 8.6 times, respectively, due to 1.5 °C of warming.In addition to the projected increase of 1.5-fold in decadal events for heavy rainfall in terrestrial areas and a 2.0-fold increase in agricultural and ecological droughts in arid regions, it is anticipated that severe snowstorms and super typhoons will undergo further intensification [2].
The First Working Group Report revealed a nearly linear relationship between cumulative CO 2 emissions and the phenomenon of global warming.To limit the temperature increase beyond the pre-industrial levels to 1.5 °C with a probability of 67% or higher, it was estimated that the remaining CO 2 emissions should not exceed 400 billion tons.The Third Working Group Report stated that in scenarios where global CO 2 emissions reach zero, approximately 74% of the required global emissions reduction would be achieved through reductions in CO 2 emissions from energy supply and demand.While renewable energy has emerged as a prominent solution, it is recognized that a combination of renewable energy sources and fossil fuel systems is still necessary to meet the current energy demand.In light of this, the present study specifically focuses on carbon capture and storage (CCS) technology [3], assuming a high carbon capture rate of 90-95% from fossil fuel systems.
Achieving carbon neutrality requires a balance between emissions and removals of greenhouse gases [4].However, in many sectors, complete decarbonization is proving to be a challenging reality.A prime example of this challenge is the power generation sector.In this context, CCS technology plays an indispensable role in effectively reducing CO 2 emissions and achieving the goal of carbon neutrality.The automotive industry is also progressing towards decarbonization.The transition from internal combustion engines to electric motors has led to a reduction in CO 2 emissions.However, charging the batteries of electric vehicles requires a substantial amount of electricity, and relying solely on renewable energy sources to meet this demand presents a formidable challenge.Nuclear power as an alternative energy source remains a subject of debate, and its utilization presents significant challenges.These challenges include the management of nuclear waste, the threat of terrorism, and the need to learn from past nuclear power plant accidents while undertaking long-term decommissioning processes, which are complex and require careful consideration.To achieve carbon neutrality, a diverse array of strategies and approaches is imperative.
CCS refers to the collective techniques of capturing carbon dioxide emitted from factories, power plants, and other sources and storing it underground before its release into the atmosphere [5].The selection of storage locations is based on geological models derived from borehole surveys and probability statistics.However, the current statistical methods used to create geological models from limited sampling information obtained through borehole drilling present challenges in terms of accuracy.By obtaining investigations of the entire geological formation, there is a possibility to construct a geological model that is more precise and accurate.In our research project [6], we specifically focus on outcrops, which are exposed parts of geological formations visible on the Earth's surface and are covered by surface soil and vegetation.By analyzing images of outcrops, we aim to identify optimal locations for storage by creating high-precision geological models.Therefore, this study aims to explore the optimal backbone for semantic segmentation of outcrop images using deep learning techniques, taking into consideration both the latest advancements in the field and computational efficiency to minimize processing time.
In this study, the primary focus was on the examination of the outcrop shown in the photograph presented in Figure 1.This paper presents our research efforts in developing a precise and efficient methodology for the classification of geological formations in outcrop images.Our approach leverages deep learning-based semantic segmentation techniques for this purpose, aiming to achieve accurate and reliable results.We investigate various backbone architectures to determine the most suitable approach for this task.By accurately characterizing geological formations, our proposed methodology can contribute to identifying optimal locations for CCS and promoting effective carbon sequestration, which is crucial for mitigating the impact of climate change.

Related Studies
The field of computer vision has extensively utilized segmentation techniques for pixel-wise object classification in images.Segmentation serves as a fundamental technology in various practical tasks, including autonomous driving and other computer vision applications [7].Until recently, Convolutional Neural Networks (CNNs) [8] have been the predominant approach for segmentation tasks in computer vision, following the introduction of Fully Convolutional Networks (FCN) [9] and their subsequent advancements.In 2020, Vision Transformer (ViT) [10] was introduced as an architecture that adapted the successful Transformer model [11] from natural language processing to image recognition.ViT surpassed the state-of-the-art BiT (Big Transfer) [12] method in terms of accuracy, leading to a surge of research utilizing ViT in the field of image segmentation.
As a transformer-based architecture for image recognition, ViT builds upon the Transformer model that revolutionized natural language processing tasks.By omitting convolutional operations, ViT achieves improved computational efficiency and scalability.Unlike CNNs, Transformers lack the inherent inductive bias that considers the proximity of information in convolutional layers as relevant, necessitating a large amount of training data for generalization.Geirhos et al. [13] highlighted the classification characteristics of CNNs, which prioritize textures over object shapes, as differing from human perception.Conversely, Tuli et al. [14] revealed that ViT's classification characteristics are biased towards object shapes and align more closely with human perception.
In recent years, there has been a growing interest in investigating not only CNNs and ViTs as individual backbone architectures but also hybrid backbone architectures that combine both approaches.Moreover, there has been a resurgence of interest in employing a simplistic backbone architecture that solely comprises Multi-Layer Perceptron (MLP) networks [15].MLPs are feedforward neural networks composed of multiple layers of interconnected perceptrons, each consisting of weighted inputs, an activation function, and a bias term.MLPs can be seen as a deep extension of traditional neural network architectures, where the concept of depth refers to the increased number of layers in the network.By stacking multiple layers, MLPs can capture hierarchical representations of input data, enabling them to learn intricate and abstract features through their deep structure.This deep layering allows MLPs to effectively model complex relationships and patterns in the input data, making them powerful tools in various machine learning tasks.While MLP-based backbones demonstrate performance comparable to BiT and ViT in classification tasks, their applicability to segmentation tasks has not been fully explored.
Traditionally, high-precision 3D models of geological strata were created using groundbased laser scanning methods [16].However, this approach has limitations such as the weight of surveying equipment, the need for scans from multiple field-based positions, and the time-consuming nature of data acquisition.Consequently, the modeling process presents significant challenges in regions where conducting in situ measurements and data collection is impractical or entails risks.In such cases, drones equipped with cameras are being utilized.Researchers such as Corradetti et al. [17] employed drones to capture photographs of cliffs composed of nearly vertical outcrops, creating 3D models that were used for crack analysis and understanding crack propagation patterns.Similarly, Sharad et al. [18] used drones to capture high-resolution images of complex and hazardous landslides, generating cm-level accuracy 3D models.Javier et al. [19] employed drones to create highly accurate and high-resolution 3D models for identifying and interpreting ancient Roman gold mining sites in Northwestern Spain, revealing areas such as excavation sites, canals, reservoirs, and drainage channels.Mirkes et al. [20] proposed a semantic segmentation method for rock outcrops that leads to the detection and segmentation of various geometric features, including fractures, faults, and sedimentary layers.Zhang et al. [21] state that most existing semantic segmentation methods are based on FCNs, which replace the fully connected layer with fully convolutional layers for pixel-level prediction.Malik et al. [22] proposed a segmentation method using a model that combines U-Net [23] and LinkNet [24] to classify three classes: background, sandstone, and mudstone.They conducted an evaluation experiment on a self-collected dataset of 102 images from a field in Brunei Darussalam, demonstrating higher accuracy compared to conventional methods.However, their proposed method and the comparison methods were based on conventional CNN-based backbones, without considering recent advancements in deep learning.Vasuki et al. [25] proposed an interactive segmentation method primarily using edge features extracted from rock images obtained using a drone.They focused on superpixels as the minimum resolution, emphasizing geological analysis and engaging in image sensing and analysis.However, their segmentation method relied on conventional image processing methods for feature extraction and did not incorporate learning-based techniques, which might not provide sufficient accuracy and generalizability for this application domain.
Although research utilizing drones in geosciences has gained momentum [26], most studies focus on analyzing topography using 3D models generated from captured photographs.However, there are a limited number of reported studies [22,25] that apply segmentation-based approaches to classify lithostratigraphy in geological outcrops.This paper emphasizes the mounting interest in employing CNNs, ViT, and investigating hybrid backbone architectures for segmentation tasks, while also acknowledging the expanding utilization of drones in geological studies.Furthermore, it identifies a research gap concerning the application of segmentation-based approaches to the lithostratigraphic classification of geological outcrops.

OutcropHyBNet
We propose a novel approach named OutcropHyBNet, which combines a state-of-theart CNN architecture, DeepLabv3+ [27], and a transformer-based vision model, SegFormer, to tackle the task of stratum semantic segmentation in outcrop images.The overall architecture of our proposed method is illustrated in Figure 2. OutcropHyBNet leverages the robust segmentation capabilities of DeepLabv3+ and the expressive power of SegFormer [28] as the backbone networks for accurate and efficient stratum segmentation.To enhance the diversity of training data, we employ Only Adversarial Supervision for Semantic Image Synthesis (OASIS) [29] in image synthesis.During the segmentation training process, our dataset includes both original outcrop images and synthetic images generated using OASIS.OASIS utilizes the power of generative models to produce synthetic outcrop images that manifest characteristics closely resembling those observed in real-world data.By incorporating OASIS-generated images into the training dataset, we expand the available data and improve the model capability to handle various outcrop images.The OutcropHyBNet architecture is designed to harness the power of CNN and ViT backbones for accurate semantic segmentation of outcrop images.The input images are processed through both backbones, allowing for efficient feature extraction and comprehensive contextual understanding.The extracted features are further processed by additional layers to perform pixel-wise classification, resulting in the generation of high-quality segmentation maps.As the baseline model for OutcropHyBNet, we integrate DeepLabv3+ and SegFormer into the architecture.Herein, SegFormer is one of the state-of-the-art semantic segmentation models that adopts a transformer-based architecture [30].By leveraging the capabilities of SegFormer, we aim to improve the accuracy and performance of outcrop image segmentation in our proposed method.By contrast, DeepLabv3+ is a lightweight model that exhibits superiority in stuff classification.Although the ViT has gained significant attention in the field of computer vision, CNNs still demonstrate strong potential in segmentation tasks, particularly in areas involving texture and stuff [31].For this mechanism, OutcropHyBNet can flexibly utilize both backbones based on the segmentation target.

Data Augmentation with GANs
Generative Adversarial Networks (GANs) [32] are a generative model based on adversarial training without extensively annotated training data [33].GANs offer a technique for generating realistic data, such as images, from random noise.In our previous study [34], we demonstrated the power and effectiveness of image synthesis for semantic segmentation applications in agriculture.
The network architecture of GANs consists of two main components: a generator G and a discriminator D. The G is responsible for generating synthetic images, while the role of D is to distinguish between real images from a dataset and fake images generated by G.The G aims to deceive D by generating images that closely resemble real ones, while D strives to accurately classify the input images as real or fake.Both G and D networks are trained adversarially and simultaneously.The training process involves iteratively updating the networks in an attempt to achieve a dynamic equilibrium, where G becomes increasingly proficient at generating realistic images, and D becomes increasingly adept at discriminating between real and fake images.The G receives random noise as input and transforms it into synthesized images.The D, on the other hand, receives either real images from a dataset or generated images from G as input and outputs a probability score indicating the likelihood of the input being real.
By optimizing the respective objectives of G and D through backpropagation and gradient descent, GANs learn to generate high-quality synthetic data that closely resembles the real data distribution.Since their introduction, GANs have undergone significant improvements and spawned various derivative models.These improvements have expanded the capabilities of GANs and paved the way for extensive research in the realm of semantic image synthesis.In this study, we introduce OASIS [29], a novel generative model based on GANs, which harnesses the power of the adversarial training paradigm to synthesize images with desired semantic content.

OASIS
In recent years, research on data generation has gained significant momentum, driven by the introduction of diffusion models (DMs) [35].Although DMs have demonstrated their efficacy in various vision tasks [36], they often require substantial computational resources and impose a heavy memory burden.In this study, prioritizing ease of implementation and computational efficiency, we have selected OASIS [29] as our model of choice, which is based on the GAN framework.
To generate high-quality images that align with the input semantic label map, G requires D, which can effectively capture semantic features at various resolutions.In the OASIS framework, the role of D is structured as a multi-class segmentation task.The architecture adopted in D is an encoder-decoder network, specifically based on the U-Net [23] with skip connections.The segmentation task for D aims to predict per-pixel class labels for real images, considering the given semantic label map as the ground truth.In addition to the N semantic classes obtained from the label map, all pixels of the synthesized images are classified as an additional class.Therefore, the formulated segmentation task involves N + 1 classes, and OASIS employs a cross-entropy loss with N + 1 classes for training.
As the segmentation task deals with class imbalance due to varying class frequencies, there is a possibility that the performance may be hindered.To mitigate this issue, OASIS leverages pixel-level loss calculation in D. Specifically, each class is weighted inversely proportional to the frequency of occurrence at the pixel level within a batch.This weighting scheme assigns higher weights to classes with lower frequencies, aiming to alleviate the impact of class imbalance and improve accuracy for classes with low occurrence.As a result, the contribution of each class to the loss is normalized, leading to improved accuracy for classes with low occurrence.The loss L D of the updated D is formulated as follows: where x represents real images, H and W represent the image height and width, (z, t) is the combination of noise and label map used by G to produce synthesized images, and D maps real or synthesized images to per-pixel (N + 1)-class prediction probabilities.Here, E denotes a unit vector in a normed vector space.The ground truth label t is a 3D tensor, where the first two dimensions correspond to spatial positions (i, j) ∈ H × W, and the third dimension encodes the class c ∈ 1, . . ., N + 1 as a one-hot vector.When designing G to align with D, the loss function for G is expressed as following.
To enable multi-modal synthesis through noise sampling, G is designed to synthesize diverse outputs from input noise.Hence, a noise tensor of size M × H × W is constructed to match the spatial dimensions of the N × H × W label map, where N represents the number of semantic classes, and M corresponds to the number of masks.During training, the 3D noise tensor is sampled channel-wise and fed to each pixel of the image.After sampling, the noise and label maps are concatenated along the channel dimension, forming a (M + N) × H × W noise-label concatenation 3D tensor.This concatenation tensor serves as input to the first generation layer and spatially adaptive normalization layers of each generation block.The 3D noise has sensitivity at the channel and pixel levels, allowing for specific object-level image generation by sampling noise locally for each channel, label, or pixel during testing.

DeepLabv3+
For pixel-level image segmentation, DeepLabv3+ [27] represents a significant advancement within the renowned DeepLab model family [37].This architecture has been specifically designed to excel in the task of precise and detailed segmentation, offering exceptional performance and accuracy.By leveraging advanced techniques and innovations, DeepLabv3+ pushes the boundaries of pixel-level image segmentation and stands as a testament to the ongoing progress within the DeepLab model family.DeepLabv3+ has garnered significant acclaim for their remarkable prowess in achieving precise and efficient semantic image segmentation.With its enhanced architecture and refined techniques, DeepLabv3+ builds upon the foundation established by its predecessors, pushing the boundaries of segmentation capabilities even further.Moreover, DeepLabv3+ has achieved outstanding performance on various benchmark datasets, surpassing previous state-of-the-art methods in terms of accuracy and computational efficiency.Its ability to capture contextual information at multiple scales and preserve fine details has made it particularly effective in tasks such as object recognition, scene understanding, and medical image analysis.
The architecture of DeepLabv3+ builds upon the strengths of its predecessors by incorporating an encoder-decoder structure along with atrous convolutions [38] and atrous spatial pyramid pooling (ASPP) modules [27].The encoder network, usually based on pre-trained CNNs such as ResNet [39] or Xception [40], extracts high-level features from the input image while preserving spatial information.The atrous convolutions enable the network to capture multi-scale contextual information without significantly increasing the computational cost.The decoder network employs bilinear upsampling to restore the spatial resolution of the features obtained from the encoder.Additionally, skip connections from earlier layers are incorporated to ensure that fine-grained details are preserved in the final segmentation.The ASPP module further enhances the receptive field of the network by applying atrous convolutions at multiple dilation rates and capturing contextual information at different scales.

SegFormer
SegFormer [28] adopts a ViT-based methodology, leveraging its distinctive Mix Transformer (MiT) encoder.The MiT encoder consists of a hierarchical Transformer, overlapped patch merging, efficient self-attention, and Mix-FFN.These components collectively contribute to the effectiveness and efficiency of the SegFormer model for segmentation tasks.Unlike ViT, which can only generate feature maps at a single resolution, the hierarchical Transformer in SegFormer produces multi-level feature maps.These maps provide both high-resolution coarse features and low-resolution fine-grained details, contributing to improved segmentation accuracy.
ViT incorporates Positional Encoding (PE) to capture positional information.However, the resolution of PE is fixed.As a result, when the resolution differs between training and testing, the accuracy may deteriorate.To address this issue, a Mix-FFN is introduced, which applies a 3 × 3 convolutional layer directly to the feed-forward network (FFN).
SegFormer adopts a lightweight decoder consisting solely of MLP layers, known as the All-MLP decoder.This avoids the computationally expensive configurations used in other methods.The hierarchical Transformer encoder in SegFormer enables this simple decoder by having a larger effective receptive field (ERF) compared to the encoder of traditional CNNs.

Cross-Entropy Loss
To train DeepLabv3+ and SegFormer, a large-scale dataset annotated with pixel-level labels is required.Typically, the network is trained in a supervised manner using a crossentropy loss function L CE given by: where N is the number of pixels; C is the number of classes; y ij represents the ground truth label for pixel i and class j; and p ij is the predicted probability of pixel i belonging to class j.

Evaluation Criteria
In this study, we employ the Fréchet Inception Distance (FID) [41,42] as a formal evaluation criterion.By incorporating information about the underlying distributions and the representation of features, the FID metric provides a comprehensive assessment that captures the fidelity and resemblance of the generated samples to the real data.The FID metric utilizes a pre-trained Inception network [43] that has been trained on the ImageNet dataset [44].The pre-training on ImageNet helps capture general visual features and enables transfer learning, where the learned representations are fine-tuned for specific tasks [45].By leveraging the representation power of the Inception network, FID provides a quantitative measure of the quality and diversity of generated images compared to the real image distribution.FID calculates the distance between the feature vectors extracted from the real images and the generated images, quantitatively evaluating the similarity between the two.The FID is defined as follows: where m w and m represent the means of the feature vectors extracted from the generated images and the real images, respectively.C w and C represent the covariance matrices of the feature vectors.Subsequently, to assess the quality of segmentation, Intersection over Union (IoU) is employed as the evaluation metric in this study.IoU represents the degree of intersection between the predicted region and the ground truth region, and mIoU represents the average IoU across all classes.IoU is calculated using the following equation: Herein , True Positive (TP) corresponds to the instances where both the prediction and the class are true.False Positive (FP) represents the instances where the prediction is false, but the class is true.False Negative (FN) denotes the instances where the prediction is true, but the class is false.

Preliminary Performance Evaluation with Benchmark Datasets 4.1. Data Profiles and Setups
We evaluated the performance of the proposed method, OutcropHyBNet, in a general context using two benchmark datasets: COCO-Stuff10K [46] and ADE20K [47].These datasets encompass diverse scenes and objects commonly encountered in everyday environments, facilitating a comprehensive evaluation of the proposed method's performance in real-world scenarios.
The COCO-Stuff10K serves as an extensively utilized benchmark dataset for tasks related to scene understanding and segmentation.Comprising 10,000 high-resolution images, this dataset features pixel-wise annotations.The images within this dataset exhibit diverse resolutions, ranging from 480 × 640 to 960 × 1280 pixels, while maintaining an aspect ratio of 3:4.The dataset provides comprehensive annotations for both objects and stuff categories.It includes 80 object categories, such as person, car, and dog, and 91 stuff categories, such as sky, grass, and road.The pixel-level annotations enable detailed semantic segmentation of scenes, facilitating the evaluation and development of advanced computer vision algorithms.Moreover, the dataset provides a wide array of visual scenes, encompassing a comprehensive spectrum of both indoor and outdoor environments.It serves as a standard benchmark for evaluating and contrasting the performance of semantic segmentation models.
The ADE20K dataset is a widely used dataset for semantic segmentation tasks.It comprises more than 20,000 high-resolution images, specifically 150 objects and 50 stuff categories.All images in the dataset have a fixed resolution of 512 × 512 pixels.The ADE20K dataset provides pixel-level annotations for both objects and stuff categories, enabling fine-grained semantic segmentation.It covers a diverse range of scenes, including indoor and outdoor environments, and captures various objects and stuff categories commonly encountered in everyday life.Moreover, the ADE20K dataset is designed to facilitate research and development in scene parsing and semantic understanding.It serves as a benchmark for evaluating the performance of semantic segmentation models and has been widely adopted in the computer vision community.The inclusion of this dataset allows for a comprehensive assessment of the generalization capability of the proposed method, OutcropHyBNet.

Experimental Setup
For this study, we utilized MMSegmentation [48], an open-source segmentation toolbox developed by OpenMMLab, as the designated implementation platform.MMSegmentation offers a comprehensive and versatile solution specifically tailored for semantic segmentation tasks.Its open-source nature and seamless integration with PyTorch provide us with a valuable resource for conducting our evaluation experiments.One of the key strengths of MMSegmentation lies in its wide array of segmentation models, catering to diverse requirements in the field.This rich collection of models establishes MMSegmentation as an invaluable asset for various developers.With its extensive toolkit, we can effectively address various segmentation tasks and explore different approaches, thereby enhancing the depth and breadth of our research and practical applications.
The computation for this study was carried out on a single NVIDIA RTX A6000 GPU.Renowned as a high-performance GPU, the A6000 is purpose-built to tackle professional workloads in various fields, including data science, deep learning, AI research, and content creation.Its exceptional capabilities make it an ideal choice for handling the intensive computational tasks.The A6000 has 10752 CUDA cores, 48 GB of GDDR6 memory, and a memory bandwidth of 768 GB/s.With its powerful architecture, it delivers exceptional performance for tasks such as deep learning training, real-time ray tracing, and high-resolution rendering.The parameters for each method were determined using the configuration file of the pretrained model that achieved the highest accuracy on the ADE20K dataset, which is provided by MMSegmentation [49].

Class Balancing for Uneven Data
To mitigate the challenge of class imbalance [50], we employed class balancing techniques [51] as a simple and practical approach for data adjustment and enhancement.Let x represent the number of pixels in a class and y represent the total number of pixels excluding unlabeled pixels.The weight w is calculated using the following equations: where z represents the median of z.
Table 1 provides a comprehensive overview of the calculated weights, which were determined considering the pixel occupancy ratio.The weights were assigned in such a way that they decrease as the pixel occupancy ratio increases, and conversely, they increase as the pixel occupancy ratio decreases.The approach aims to effectively mitigate the issue of class imbalance by assigning higher weights to underrepresented classes with lower pixel occupancy ratios.This strategy ensures that these classes receive greater attention during the training process, thereby addressing their significance in a more comprehensive manner.By incorporating these calculated weights, we aim to achieve a more balanced and accurate model performance, ultimately improving the overall effectiveness of our approach in handling imbalanced datasets.We evaluate the performance of the models using the Intersection over Union (IoU) metric for each class.Table 2 presents the comparison of class balancing results for both DeepLabv3+ and SegFormer models.From the results, we observe that class balancing has a significant impact on the performance of both models.SegFormer shows improvements in most classes, except for the Black class.The decrease in performance for the Black class in SegFormer can be attributed to a specific image (Image 10), where the IoU is significantly lower compared to other images.The IoU for the Black class in DeepLabv3+ remains relatively stable across all images.

Data Augmentation
Our proposed approach, OutcropHyBNet, utilizes OASIS to generate images and augment the dataset.By leveraging OASIS-generated images, we expand the breadth and depth of our dataset, enabling a more comprehensive representation of geological features and variations.Integrating OASIS into our methodology addresses the challenge of limited real-world outcrop data and enriches the learning process of OutcropHyBNet.The combination of synthetic and real data enhances the model's capacity to accurately analyze and interpret geological formations with improved precision and reliability.
Table 3 presents the parameters used for this purpose.In this experiment, a dataset with a sampling number of 256 images was utilized, and the same dataset was used for both training and testing.DeepLabv3+ and SegFormer were used as the comparative methods.A total of 3661 images were used for evaluation, which consisted of 333 images generated using OASIS and 256 × 13 images from the dataset used in the full image experiment.The training and testing data were randomly allocated in a 9:1 ratio, resulting in 3294 images for training and 367 images for testing.Table 4 shows the results of class balancing.For the Black class, both methods improved accuracy in 4 out of 6 images.For the Red class, DeepLabv3+ improved accuracy in 10 out of 13 images, while SegFormer improved accuracy in 8 images.Similarly, for the Cyan class, DeepLabv3+ improved accuracy in 11 out of 13 images, and SegFormer improved accuracy in 8 images.Regarding the Yellow class, DeepLabv3+ improved accuracy in 6 out of 13 images, while SegFormer improved accuracy in 4 images.In terms of mIoU, DeepLabv3+ improved accuracy in 10 out of 13 images, and SegFormer improved accuracy in 6 images.It is worth noting that the Yellow class exhibited a decrease in accuracy in more than half of the images for both methods.This can be attributed to the initial weight of 0.3792, which is significantly lower compared to the absence of class balancing.Table 5 demonstrates the improved mIoU scores achieved by incorporating OASISgenerated images.These images significantly enhance the accuracy of segmenting and classifying geological formations in our proposed approach.Both segmentation methods showed improved accuracy, denoted ∆, for all classes compared to the dataset before augmentation.Particularly, they achieved an accuracy improvement of over 3% for the Cyan class.Therefore, it can be concluded that dataset augmentation using OASIS for data generation contributes to the improvement in accuracy.Furthermore, the consistent trend of CNNs backbones outperforming ViT backbones was observed throughout the evaluation.

Selection of Backbones
To verify the effectiveness of the proposed approach, a preliminary experiment was conducted for performance comparison using seven different network models with varying backbones.The backbones used for comparison were ResNet [39], HRNet [52], U-Net [23], Swin Transformer [53], MiT [28], ViT [10], and SVT [54].Table 6 presents the specific parameter configurations for each backbone utilized in this experiment.The common parameters included a batch size of 8, a class count of 4, 4 sampling patterns (64, 128, 256, and 512 images), an input image size of 256×256 pixels, and a training epoch set to 50.Regarding the input data, a random sampling was performed on 13 images, allocating them to training and testing data in a 9:1 ratio.[53,56] Swin-L 640 × 640 6.0 × 10 −5 5.0 × 10 −4 SETR [57] ViT-L 512 × 512 1.0 × 10 −3 0.0 Twins [54] SVT-L 512 × 512 6.0 × 10 −5 1.0 × 10 −2 SegFormer MiT-B5 640 × 640 6.0 × 10 −5 1.0 × 10 −2 The left panel of Figure 3 illustrates the accuracy of CNN-based methods [37].DeepLa-bv3+ [27] consistently demonstrated the highest accuracy among all sampling numbers.Additionally, across all methods, the highest accuracy was achieved when the sampling number was 256 images.Subsequently, the right panel of Figure 3 presents the accuracy of ViT and hybrid-based methods [28,57].SegFormer consistently exhibited the highest accuracy across all sampling numbers.Moreover, excluding SEgmentation TRansformer (SETR) [57], SegFormer achieved the highest accuracy when the sampling number was 256 images.Figure 4 depicts the accuracy trends and distributions for all backbones and the top two models.The red lines correspond to ViT-based backbone [10], the green lines represent hybrid backbones [57], and the blue lines represent CNN-based backbones [38].The graph visually depicts how the accuracy of these methods varies across different experimental settings.Comparing the results, the methods can be ranked in terms of accuracy as follows: DeepLabv3+ [27], SegFormer [28], Twins [54], ResNet [39], and ViT [10].In other words, on the original dataset, CNNs outperformed ViT in terms of accuracy for this context.Table 7 presents mIoU of each class [27,28].Comparing the results, DeepLabv3+ demonstrated superiority for all classes except for the Black class at sampling numbers of 64.Additionally, in all sampling numbers except for 64 images, DeepLabv3+ outperformed SegFormer.Analyzing the mean scores, DeepLabv3+ consistently showed superior performance in all classes.Furthermore, in both methods, the classes with the highest accuracy were ranked as follows: Black, Yellow, Cyan, and Red.

Segmentation Results
Figure 5 shows the comparison results for all classes in each dataset.In ADE20K, DeepLabv3+ achieved an mIoU of 29.36%, while SegFormer achieved an mIoU of 41.38%.SegFormer demonstrated superiority in 154 out of 171 classes (90% of the total classes).In COCO-Stuff10K, DeepLabv3+ achieved an mIoU of 38.78%, while SegFormer achieved an mIoU of 48.40%.SegFormer exhibited superiority in 138 out of 150 classes (92% of the total classes, see Appendix A).
In COCO-Stuff10K, DeepLabv3+ outperformed SegFormer in terms of accuracy for certain classes.Among the things classes, DeepLabv3+ exhibited higher accuracy than SegFormer in four classes: surfboard, sports ball, car, and mouse, out of the 80 classes.In the stuff classes, DeepLabv3+ demonstrated higher accuracy in 13 classes: platform, mountain, stone, straw, bush, bridge, roof, house, cabinet, floor-other, float-wood, carpet, and wall-panel, out of the 91 classes.Conversely, SegFormer showed higher overall accuracy compared to DeepLabv3+ in both datasets.Similarly, in ADE20K, DeepLabv3+ surpassed SegFormer in accuracy for specific classes.Among the things classes, DeepLabv3+ achieved higher accuracy than SegFormer in 4 classes: railing, base, food, and monitor, out of the 115 classes.In the stuff classes, DeepLabv3+ demonstrated higher accuracy in 8 classes: house, river, skyscraper, hovel, path, tower, stairway, and pier, out of the 35 classes.Once again, SegFormer exhibited higher overall accuracy than DeepLabv3+ in ADE20K.In both datasets, the percentage of classes where DeepLabv3+ showed superiority was higher in the stuff classes compared to the things classes.This can be attributed to the fact that stuff classes lack well-defined boundaries, and the CNN-based architecture of DeepLabv3 utilized by DeepLabv3+ may have provided an advantage in texture classification, as mentioned earlier.Table 8 presents the top score classes observed in the COCO-Stuff10K dataset, while Table 9 showcases the top score classes identified in the ADE20K dataset.These tables provide a comprehensive overview of the most prominent classes present in each dataset, shedding light on the prevalent semantic categories and objects captured in the respective datasets.The identification and analysis of these top score classes contribute to a deeper understanding of the dataset composition and can inform the development of more effective models and algorithms for semantic segmentation and scene understanding tasks.
Focusing on the stuff classes, which are the classes of interest in this study, the top 10 classes combined for both methods include 3 classes (15%) in COCO-Stuff10K and 8 classes (40%) in ADE20K.On the other hand, the bottom 10 classes combined include 24 classes (75%) in COCO-Stuff10K and 9 classes (45%) in ADE20K.Therefore, it can be inferred that stuff classes have a lower representation in the top classes and a higher representation in the bottom classes.

Custom Dataset Profile
To assess the effectiveness of the proposed method, we employed two custom benchmark datasets: stationary camera-captured ground-level images and aerial images captured by drones.The stationary images dataset consists of a collection of images captured from the perspective of a person on the ground, with meticulous attention given to their inclusion and additional insights provided by domain experts.These images were taken using a Ricoh GR III camera, an off-the-shelf device widely recognized for its high-quality imaging capabilities.
The aerial images dataset comprises images captured from drones flying at varying altitudes.These images afford a bird's-eye view perspective, facilitating the analysis of expansive scenes and the capture of distinctive visual information.The dataset includes diverse landscapes, urban areas, and natural environments, enabling the evaluation of the proposed method's effectiveness in aerial image analysis tasks.Both datasets were carefully curated and annotated to provide ground truth labels for evaluation.The inclusion of these custom evaluation datasets allows for a thorough assessment of the proposed method's performance across different viewing angles and environments.

Stationary Ground-Level Images
Figure 6 presents the original images from our dataset, accompanied by their corresponding annotation images.The images were annotated by geological experts, who selectively cropped them to capture the regions of interest (RoI).Consequently, the image sizes exhibit variability due to the purposeful RoI extraction limited to the pertinent areas.Table 10 presents the relationship between geological lithology, grain size, grain sorting, and annotation colors: Yellow, Cyan, Red, and Black.Average grain size is shown on the Krumbein φ scale based on geological analysis.The degree of grain sorting depends on the particle size classification.Table 11 presents the resolutions in each image.Due to the burden of annotation, we clipped saliency partical images as RoIs because this is the standard way of annotation by geological experts.The burden is extremely high if the full sizes of images are set to annotation targets.Table 12 presents the pixel frequency for each class, providing a comprehensive overview of the distribution of pixels among different semantic classes.The presence of class imbalance within the dataset necessitates the implementation of class balancing techniques to ensure equitable representation and promote accurate model performance.Our custom dataset comprises outcrop images captured using a stationary camera.These images were manually annotated by domain experts specializing in geological analysis, using four labels.For the sake of convenience, unlabeled regions were assigned None, represented by the Green label.This labeling approach facilitates the handling of regions without specific semantic attributes.The semantic classes were allocated using a color scheme, with the unlabeled pixels represented by the color green, and the labeled pixels distributed among Black, Red, Cyan, and Yellow, resulting in a total of four labels used for classification.The green pixels were excluded from the calculations, and thus the classification was performed using the remaining 4 labels across a total of 13 images, as shown in Figure 6.

Aerial Images
We have utilized various types of drones for sensing the vertical distributions of CO 2 [58], horizontal distributions of particulate matter [59], and crops in rice paddy fields [60].For this study, aerial images were obtained using the DJI Mavic 2 Pro, which is a compact drone manufactured by DJI.The process of capturing the images is depicted in Figure 7.The scale of the outcrop can be visually compared with the size of the two individuals captured in the photograph.Among the collected images, one specific image was chosen for evaluation, as depicted in Figure 8a.We divided this image into 64 equal-sized rectangles to make it suitable for segmentation.To facilitate the evaluation process, geological experts provided annotations for five specific labels on the image, as illustrated in Figure 8b.The image used for evaluation had a resolution of 5464 × 3640 pixels.The annotation data applied to this image followed the same criteria as the original dataset, performed by domain experts.The trained models used for inference were trained using the OASIS extended dataset for DeepLabv3+ and SegFormer.During inference, the aerial image was divided into an 8 × 8 grid and each sub-image was used as input.Consequently, the input size for each sub-image was 683 × 455 pixels.The dataset consisted of a total of 64 sub-images resulting from the division.
Table 13 presents the class-wise Intersection over Union (IoU) of each image.The scores are arranged in descending order of mIoU.Although there are 13 images, there is a 2-fold difference in accuracy.Additionally, "-" indicates images that do not contain the Black label.While the individual class-wise IoU for Black is high, it represents the average value across four images.The Red class has the lowest IoU.Table 14 shows the correlation coefficients between the pixel occupancy ratio and ranking of each image.Note that "green" is not included as it does not affect the accuracy.The ranking of mIoU is based on the accuracy order of the average mIoU for both methods.The correlation coefficient represents the correlation between the ranking of the class's pixel occupancy ratio and the ranking of mIoU.Negative correlation was observed for "red".This can be attributed to the low overall pixel occupancy ratio of the Red dataset, which is 13.12%.As the pixel occupancy ratio of Red in the test data increases, the pixel occupancy ratio of Red in the training data decreases, leading to a decrease in accuracy due to insufficient data.
Negative correlation was also observed for "Cyan", which is believed to be for the same reasons as red.On the other hand, strong positive correlation was observed for "Yellow".This is because the overall pixel occupancy ratio of the Yellow dataset is high at 45.22%, and the training data is sufficient.Therefore, as the pixel occupancy ratio of Yellow in the test data increases, the accuracy improves.No correlation was found for "Black".Hence, it can be inferred that the imbalance in pixel occupancy ratio affects the accuracy.This is likely due to data insufficiency, indicating the need for techniques such as data augmentation to balance the pixel occupancy ratios.Figure 9 illustrates the segmentation results.Table 15 presents the compared mIoU for each class.In terms of IoU, SegFormer demonstrates superiority across all classes.The overall IoU shows a difference of 8.18%.The largest accuracy difference is observed for the Black class, while the smallest difference is observed for the Yellow class.Confusion matrices are widely used in deep learning for evaluating the performance of classification models [61].Due to significant variations in accuracy across images, we present the confusion matrices for image 13, which has the highest accuracy, and image 8, which has the lowest accuracy, in Figure 10, as depicted in Table 14.The Confusion Matrix reveals that the accuracy of SegFormer, compared to DeepLabv3+, is 40% higher for the Black class and 16% higher for the Cyan class.This difference in accuracy contributes to the discrepancy in mIoU.
These results unequivocally demonstrate the superior performance of SegFormer in semantic segmentation compared to DeepLabv3+.The higher IoU scores obtained by SegFormer indicate its capability to better capture object boundaries and classify pixels accurately.This can be attributed to the architecture of SegFormer, which incorporates Transformer-based models, allowing for more effective feature extraction and contextual understanding.The significant accuracy difference observed for the Black class suggests that SegFormer excels in segmenting objects with complex shapes and intricate details.The Black class objects may possess fine textures or indistinct boundaries, and SegFormer's capability to capture such nuances contributes to its superior performance.On the other hand, the minimal difference in accuracy for the Yellow class implies that both models perform similarly in segmenting objects of this class, which may have more distinguishable features or simpler shapes.

Segmentation Results of Aerial Images
In order to broaden the scope of validation and explore new possibilities, we applied OutcropHyBNet to aerial images for segmentation, expanding the range of applications in CCS.The segmentation results are evaluated using the mIoU metric, which assesses the accuracy and consistency of the predicted segmentation masks with respect to the ground truth masks.We applied our model, OutcropHyBNet, which had been trained using ground-level stationary images, to the aerial images in the dataset and obtain segmentation results.The model assigned a semantic label to each pixel, effectively distinguishing and categorizing different objects and regions within the image.The resulting segmented images provide a visual representation of the distinct entities present in the aerial scenes.By presenting the segmentation results obtained using OutcropHyBNet, we aim to demonstrate its effectiveness in segmenting aerial images.
Figure 11 presents the segmentation results obtained by applying DeepLabv3+ and SegFormer to the input image depicted in Figure 8a.The comparison reveals that SegFormer surpasses DeepLabv3+ in effectively capturing fine details and accurately delineating object boundaries.Specifically, a notable distinction can be observed in the segmentation results of the Black class, where SegFormer exhibits significantly improved performance compared to DeepLabv3+.
A comparison of the confusion matrices shown in Figure 12 reveals notable differences in accuracy between the two methods.Specifically, SegFormer achieves a 40% higher accuracy for the Black class and a 16% higher accuracy for the Cyan class compared to DeepLabv3+.These differences in accuracy directly contribute to the observed discrepancy in mean IoU (mIoU) between the two methods.The segmentation results produced by SegFormer exhibit clearer and more accurate delineation of the object classes, particularly for the Black and Cyan classes.On the other hand, DeepLabv3+ tends to produce more fragmented and less precise segmentation outputs.Overall, these figures visually demon-strate the superior performance of SegFormer in terms of accurate and detailed semantic segmentation compared to DeepLabv3+.Table 16 presents the average IoU for each class.It is noteworthy that SegFormer exhibits superior performance in terms of IoU for all classes when compared to DeepLabv3+.Particularly, there is a significant 40% difference in the Black class, which results in a notable 16% difference in mIoU between the two methods.Nevertheless, the mIoU scores for both methods are below 50%, highlighting the need for further improvements to enhance the segmentation accuracy for this aerial image dataset.Examining the results in Table 16, Seg-Former consistently outperforms DeepLabv3+ in capturing the fine details and boundaries of the objects, leading to higher IoU scores.The Black class exhibits the largest disparity, highlighting the difficulty of accurately segmenting this class with DeepLabv3+.On the other hand, SegFormer achieves significantly better results for the Black class, indicating its effectiveness in handling such challenging scenarios.Overall, the results demonstrate that SegFormer provides improved performance in semantic segmentation tasks, especially in capturing detailed structures and enhancing the accuracy of challenging classes.
Figure 13 illustrates the segmentation results for the top three images based on the average mIoU scores of both backbone networks on OutcropHyBNet.These images predominantly capture the central regions of the scene.This suggests that the models have successfully captured the patterns and can generalize well to unknown images.On the other hand, the bottom images predominantly contain only the outer regions with the Black class.This indicates a potential deviation in the characteristics of the Black class compared to the original dataset.To address this issue, some of the images underwent re-annotation by experts.Table 17 presents the mIoU results after re-annotation.The fourth and sixth columns denote the differences ∆ in comparison to the results obtained from the initial annotation, illustrating the changes resulting from the re-annotation process.In all conditions except for SegFormer in the 1st row and 7th column, clear improvements in accuracy are observed after re-annotation.For SegFormer in the 1st row and 7th column, the model predicted the regions that turned from Cyan to Black after re-annotation as Black, resulting in a slight improvement of less than 1% in accuracy.It can be concluded that the performance improvement was limited in this case.These results suggest the potential of deep learning models to suggest re-evaluation of annotations by humans, as they can contribute to the improvement of accuracy in semantic segmentation tasks.Figure 14 presents the segmentation results of three images from Table 17 after the re-annotation process.In comparison to the results depicted in Figure 11, the colored labels in Figure 14 have been mapped according to the texture in respective stratums.

Conclusions
The objective of this study was to analyze the distribution of geological strata through the application of segmentation techniques on geological outcrop images, facilitating a comprehensive understanding of their spatial arrangement.We proposed OutcropHyB-Net, which leverage DeepLabv3+ and SegFormer for semantic segmentation, along with OASIS for data augmentation.We conducted evaluations and comparisons of the classification performance and accuracy of both models across different classes using two publicly available benchmark datasets.In our preliminary experiments, we presented compelling evidence of the enhanced performance of DeepLabv3+ in classes heavily reliant on textures, particularly in the context of stuff classes.The superiority of DeepLabv3+ in accurately classifying textures within the dataset was observed to a significant extent, substantiating its effectiveness in such scenarios.In the evaluation experiments using our original datasets, we revealed that for non-standard objects with ambiguous shapes resembling geological strata, where classification depended on texture, CNNs exhibited superiority.Our study revealed that SegFormer outperformed other models in scenarios with limited data availability.Additionally, we identified that imbalanced class distributions had a notable impact on the accuracy of the models.Notably, we found that employing class balancing techniques resulted in enhanced accuracy for DeepLabv3+ compared to SegFormer.Moreover, our findings revealed that the utilization of OASIS for data augmentation significantly contributed to enhanced accuracy.By incorporating OASIS into the training process, we observed improved precision and performance in the classification task, highlighting the effectiveness of data augmentation techniques in enhancing the overall accuracy of the models.In the evaluation experiments conducted on ground-level images obtained using a stationary camera and aerial images obtained using a drone, we successfully demonstrated the superior performance of SegFormer across all classes.The comprehensive analysis revealed that SegFormer consistently outperformed other models in accurately classifying various objects and features present in the aerial images, highlighting its effectiveness and superiority in this specific context.
Our future endeavors encompass several challenges, including the augmentation of diversity through the collection of aerial images from various sources and types.By expanding our dataset to include a broader range of aerial images, we aim to improve the robustness and generalization capabilities of our models.Additionally, we plan to further explore and enhance data augmentation techniques to augment the diversity within the existing dataset, thereby fostering more comprehensive and representative training samples.Moreover, we will explore methods to improve the reproducibility of texture and color in image generation using GANs and DMs.We will also propose annotation modifications based on the inference results to further improve accuracy.

Figure 3 .
Figure 3.Comparison of CNN-based and ViT-based methods in terms of accuracy at different sampling numbers.

Figure 4 .
Figure 4. Accuracy trends and distributions for all backbones and top two models.

Figure 6 .
Figure 6.Original and annotation images of our custom dataset.

Figure 7 .
Figure 7. Process of capturing aerial images with the involvement of geological experts and a drone.
(a) Original image (b) GT image

Figure 8 .
Figure 8. Selected aerial image for evaluation.

Figure 9 .
Figure 9. Segmentation results with DeepLabv3+ (first and third rows) and SegFormer (second and fourth rows).
(a) Image 13 with the highest accuracy (b) Image 8 with the lowest accuracy

Figure 10 .
Figure 10.Confusion matrices of the highest and lowest accuracies.

Figure 11 .
Figure 11.Segmentation results obtained from both backbone networks for aerial images.

Figure 12 .
Figure 12.Confusion matrices for both segmentation results.

Figure 13 .
Figure 13.Segmentation results of the top three images.

Figure 14 .
Figure 14.Segmentation results of three images after re-annotation.

Table 1 .
Calculated weights based on pixel occupancy ratio for class balancing.

Table 2 .
Comparison of class balancing results for both models [%].

Table 3 .
Parameters for OASIS.

Table 6 .
Model configurations for semantic segmentation.

Table 10 .
The relationship between geological lithology, grain size, grain sorting, and annotation colors.

Table 11 .
Image size and number.

Table 12 .
Pixel frequency for each class.

Table 14 .
Correlation between the pixel occupancy ratio and ranking of each image [%].